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Abstract 


Reach  into  a  box  containing  m  balls  and  pull  out  a  geometric  (p)  -  sized  sample. 
Then  put  the  balls  back  into  the  box  and  sample  again.  Let  X  be  the  number  of  samples 
needed  to  see  all  m  balls.  We  derive  nonrecursive  approximation  formulas  for  the  mean 
and  standard  deviation  of  X. 


1.  Introduction 

A  box  contains  m  identical  white  balls.  Let  K\,K2,...  be  independent  geometric  (p) 

random  variables,  so  that  P{Ki  —  k}  =  (l  —  p)fc-1p  —  qk~lp  for  k  =  1,2, _ We  sample 

repeatedly  as  follows.  Sample  Ki  A  m  balls  from  the  box  without  replacement,  paint  them 
red,  and  then  return  them  to  the  box.  Then  sample  K2  Am  balls  from  the  box  without 
replacement,  paint  them  red,  and  put  them  back  into  the  box,  etc.  We  wish  to  determine 
the  mean  and  variance  of  X,  the  number  of  samples  needed  to  paint  all  the  balls  red. 

As  was  pointed  out  to  us  by  Larry  Shepp,  it  is  straightforward  to  derive  exact  recursive 
formulas  for  the  mean  and  variance  of  the  remaining  number  of  samples  needed  to  paint 
all  the  balls  red  when  j  of  the  m  balls  are  still  white. 

The  focus  of  this  paper,  however,  will  be  to  derive  good  nonrecursive  approximations 
for  EX  and  the  standard  deviation  ax-  Our  main  results  are  given  in  Propositions  3.9  (for 
EX)  and  4.18  (for  ax)-  In  Section  5,  exact  values  for  EX  and  ax  computed  using  Shepp ’s 
recursions  will  be  compared  with  our  approximations  for  several  values  of  m  ranging  from 
10  to  300  for  p  =  |. 

2.  The  {Z,W)  process 

Our  arguments  will  relate  the  original  sampling-and-painting  process  described  in 
Section  1  to  the  following  alternative  process.  Let  Z\,Z2,. . .  be  independent  random 
variables  uniformly  distributed  on  {1,2,..., m}.  Let  W\ ,  W2, ...  be  Bernoulli  (p)  random 
variables,  independent  of  each  other  and  of  the  Zj’s.  The  iid  sequence  {(Z,-,  W<)}^j  will 
be  referred  to  as  the  (Z,  W)  process. 

The  sampling-and-painting  process  of  Section  1  can  be  constructed  from  the  (Z,  W) 
process.  Suppose  the  m  balls  in  the  box  are  numbered  1, 2, . . . ,  m.  The  Z;’s  can  be  thought 
of  as  the  result  of  drawing  balls  from  the  box  one  at  a  time,  with  replacement.  Suppose 
also  that  after  each  ball  draw  we  flip  a  coin  which  has  probability  p  of  coming  up  heads. 
We  may  think  of  W,  as  being  the  indicator  of  the  event  that  the  ith  coin  toss  was  heads. 

Sampling  k  balls  without  replacement  can  of  course  be  done  by  drawing  balls  one  at 
a  time,  with  replacement,  and  ignoring  draws  of  previously  drawn  balls  until  k  distinct 
balls  have  been  obtained.  Following  this  approach,  define  the  first  sample  to  be  those  balls 
drawn  before  the  first  “counted”  head,  where  ball  draws  and  the  associated  coin  flips  are 
“counted”  only  if  the  ball  drawn  does  not  repeat  a  previous  draw  (of  the  current  sample.) 
To  get  the  second  sample,  we  start  anew  according  to  the  same  rules  after  the  first  sample 
has  been  completed.  The  process  of  generating  the  first  sample  usually  ends  with  the  first 
counted  head.  However,  if  the  coin  flips  following  the  first  m  counted  draws  are  all  tails 
(corresponding  to  the  event  {K\  >  m}),  then  the  process  of  generating  the  first  sample 
ends  after  the  mtk  counted  draw,  (and  the  first  sample  contains  all  m  balls.) 
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3.  The  expectation  of  X 

Let  Di  be  the  number  of  single-ball  draws  (counted  and  uncounted)  needed  to  gen¬ 
erate  the  ith  sample.  Note  that  the  Di' s  are  iid.  Let  Ti  be  the  (7-field  generated  by  all 

observations  made  while  generating  the  first  i  samples.  Note  that  X  is  an  Ti  stopping 

x 

time.  The  total  number  of  draws  needed  to  generate  the  first  X  samples  is  Let 

1 

q  =  l-p. 

X 

Lemma  3.1  E(J2Di)  —  {ED\)EX 
i 

Proof:  Wald’s  identity.  □ 


Lemma  3.2 


m— 1 


££>.  =  X 

«=0 


Proof 

If  K\  A  m  >  *,  then  the  number  of  draws  needed  to  obtain  the  (i  +  l)'1  distinct  ball 
after  i  distinct  balls  have  already  been  obtained  is  a  geometric  (—p)  random  variable  with 
mean  —7.  Thus, 

TO  —  t  1 


TO  — 1 


(3.3) 


E(Dt\Kt)  =  T  — — — >  i). 

r-f  m  —  i 
«=o 


Now  take  expectations  in  (3.3)  □ 

Let  r  be  the  number  of  single-ball  draws  needed  to  obtain  every  ball  at  least  once. 

Lemma  3.4 


TO— 1 


r-f  m  —  t  “  k 

1=0  *=1 

TO  — 1  TO  -  TO  - 

var(r)  =  V]  —  (— — —r)2  =  m2  V]  77  ~  m  X'  T 
v  mvm  —  v  4-^  k2  f-'  k 


Erdfff 

This  situation  is  sometimes  called  the  coupon  collector  problem.  As  in  Lemma  3.2,  r 
is  a  sum  of  geometric  (®=i)  random  variables,  here  with  t  =  0, 1, . . . ,  m  —  1.  Adding  the 
expectations  gives  the  formula  for  Et.  These  geometric  random  variables  are  independent, 
so  adding  their  variances  gives  the  formula  for  var  (r).  □ 

X  X 

Now  we  need  to  relate  Efi^Di)  to  Er.  The  sum  ^Di  and  r  are  of  course  equal 

t  1 

except  for  the  “overshoot”  given  by  the  number  of  draws  needed  to  complete  the  last 
sample  after  the  rtk  draw.  Let  V  be  the  size  of  the  overshoot,  so  that 

V  =  (£A)-r. 
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From  Lemmas  3.1,  3.2,  and  3.4 


(3.5) 


x 

E^Di 

EX  = 

EDi 

Et  +  EV 

(3.5)  “  EDl 

m 

mZj  +  EV 

_  fc=i _ 

m— 1 

t=0 

Let  J  be  the  number  of  distinct  balls  in  the  last  sample  after  the  rtk  draw.  (The  rth  draw 
is  of  course  “counted”  since  the  ball  drawn  was  by  definition  never  drawn  before.  The  rth 
draw  is  the  Jth  counted  draw  in  the  Xth  sample.)  The  expectation  of  V  is  easy  to  find  in 
terms  of  the  distribution  of  J. 


Lemma  3.6 


m  — i  wi  —  i 


Proof 

The  argument  is  again  like  the  coupon  collector  argument  used  for  Lemmas  3.2  and 
3.4.  Given  J  =  j  and  Kx  —  k  (with  j  <  k,  necessarily),  V  is  a  sum  of  geometric  (2~) 
random  variables  for  j  <i  <  k.  Thus, 


(3.7) 


E(V\J,KX)  =  £  2  •  <  Kx). 


But  given  J  =  j ,  Kx—J+1  is  a  geometric  (p)  random  variable,  by  the  memoryless  property 
of  geometric  distributions.  Thus,  taking  expectations  with  respec*  to  Kx  in  (3.7)  yields 


E(V\])  =  Y.  —>I{J  - 

t=l 

m— 1  m— 1 


Taking  expectations  on  both  sides  yields  Lemma  3.6. 

Lemma  3.8 

m— 1  _  —  i  m—\  m— 1 

‘ E  w  -  V + s  p<  J = '>  E  ^ 


4 


Proof:  Straightforward  algebra.  □ 

Remark  The  number  of  additional  “counted”  draws  needed  to  complete  the  last  sample 
after  the  rth  draw  is,  except  for  truncation,  a  “number  of  failures”  geometric  random 
variable  with  success  probability  p  and  therefore  with  mean  *.  This  is  where  the  first  term 
in  Lemma  3.8  comes  from.  The  second  term  reflects  the  truncation  due  to  the  fact  that 
no  sample  can  contain  more  than  m  balls.  The  last  term  in  Lemma  3.8  is  the  expected 
number  of  “uncounted”  (i.e.,  repeat)  draws  obtained  in  the  course  of  completing  the  Xth 
sample  after  the  rtk  draw. 

When  m  is  large,  uncounted  draws  should  be  unusual.  Assuming  that  J  is  seldom 
very  large,  the  last  term  in  Lemma  3.7  looks  like  it  should  be  of  order  m— 1  for  large  m, 
and  the  second  term  looks  like  it  might  be  of  order  qm . 

Proposition  3.9  As  m  — ♦  oo  with  p  fixed, 


m— 1 


y  -i-r  +  -f-r  y 
i—t  m  —  i  1  1  —qrn  '  m  — 

EX  - - - - ^ - +0(m~2). 

m— 1  v  ' 

y  -Uqi 

m— i  * 

<=0 


Proof 

Let  Nx  be  the  number  of  “virgin”  balls  in  the  Xth  sample  which  did  not  appear 
in  any  previous  sample.  For  large  m,  Nx  should  equal  1  except  on  a  set  of  probability 
0(m-1).  The  formula  in  Proposition  3.9  is  based  on  approximating  the  distribution  of  J 
by  the  distribution  of  J  conditioned  on  Nx  =  1. 

Lemma  3.10  For  n  =  1, 2, . . . ,  m,  and  n  <  j  <  m, 


PU  =  j|lVx  =  n}  = 


(ay-* 
t  (ay-  ‘ 

•=n 


Proof  of  Lemma  3-1Q 

Suppose  there  are  n  “virgin”  balls  which  have  not  been  seen  yet.  As  far  as  J  is 
concerned,  conditioning  on  Nx  =  n  here  is  the  same  as  conditioning  on  the  next  sample 
containing  all  of  the  remaining  virgin  balls. 

One  way  to  obtain  the  next  sample  is  to  draw  all  the  balls  from  the  box,  one  by  one 
and  without  replacement,  while  flipping  a  p-coin  after  each  draw.  The  sample  ends  with 
the  first  heads.  The  probability  that  the  last  virgin  ball  is  drawn  on  the  jth  draw  here  is 
proportional  to  (*“*).  The  probability  that  the  first  j  balls  all  get  into  the  sample  is  o-'-1 . 
Lemma  3.10  follows.  □ 

Continuation  of  Proof  of  Proposition  3.9 
For  n  =  1  in  Lemma  3.10, 

(3.11)  P{J  =  i\Nx  =  1}  =  -  frp.  !<}<">■ 
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Substituting  pq*~l /{ 1  —  qm)  for  P{J  =  j}  in  Lemma  3.6  and  then  substituting  the  result 
into  (3.5)  for  EV  produces  the  formula  in  Proposition  3.9.  (after  cancelation  of  m’s  in  the 
numerator  and  denominator.)  To  finish  the  proof,  we  need  to  show  that  the  substitution 
of  P{J  =  j\Nx  =  1}  for  P{J  =  j}  in  Lemma  3.6  (or,  equivalently,  in  Lemma  3.8)  causes 
a  change  which  is  0(m~2).  (Recall  that  D\  >  1,  so  that  the  ED\  in  the  denominator  of 
(3.5)  is  greater  than  1.) 

By  using  the  moment  generating  function  of  r  and  the  Markov  inequality,  one  can 
show  that 

(3.12)  P{t  >  m2}  <  2m>e“m/2. 

(See  Lemma  3.23  of  Sellke  (1992).)  Since  X  <  r, 

(3.13)  P{X  >  m2}  <  2m>e“m^, 
and  therefore 


P{Kx  >  rr*/2}  <  P{max  Ki  >  m/2} 

i<X 

(3-14)  <  m2P{Kx  >  m/2}  +  P{X  >  m2} 

<  m2qm/2~l  +2m>e~m/2. 

It  follows  from  (3.14)  and  J  <  Kx  that  the  second  term  in  Lemma  3.8  is  o(m~2).  It 
is  obvious  from  (3.11)  that  this  second  term  is  still  o(m"2)  if  P{J  —  j)  is  replaced  by 

P{J  =  j\Nx  =  1}. 

Now  consider  the  situation  when  the  second-to-last  virgin  ball  has  just  been  drawn. 
What  is  the  probability  that  the  last  virgin  ball  is  drawn  in  the  same  sample  as  the  second- 
to-last?  As  long  as  the  size  of  the  current  sample  is  <  m/2,  the  (conditional,  given  the 
past)  probability  that  the  next  counted  draw  will  be  the  last  virgin  ball  is  <  2/m. 

Each  counted  draw  (including  the  one  on  which  the  second-to-last  virgin  ball  was  drawn) 
of  course  has  probability  p  of  ending  the  current  sample.  It  follows  that 

2 

(3.15)  P{Nt  >  1  and  Kx  <~]  <  — . 

2  P+  m  P™ 

Combining  (3.14)  and  (3.15)  yields 

(3.16)  P{NX  >  1}  <  —  +  m2q^-1  +  2m*e-"*/2. 

pm 

A  similar  argument  shows 

(3.17)  P{NX  >  3}  <  +  m2?m/2-1  +2rnie"m/2. 

pflm3 
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Now  write  the  third  term  of  Lemma  3.8  as 


(3.18) 


m—1 


m— 1 


£  Wv  =  ">  £  P{J  =  Wx  =  n}  £  -1—  1‘ 

%  ,  .  .  7/i  —  l 

n=X  J=1  t=J 


i-j+l 


By  (3.17),  the  contributions  in  (3.18)  for  n  >  3  are  collectively  0(m-2).  By  Lemma  (3.10) 
and  (3.16)  the  contributions  for  n  —  2  and  n  =  3  are  also  0(m-2).  Finally,  (3.11)  and 
(3.16)  imply  that  replacing  P{Nx  =  1}  by  1  in  (3.18)  changes  the  sum  by  0(m~2).  Putting 
all  this  together  establishes  Proposition  3.9.  □ 


Remark  Sellke  (1992)  uses  a  more  complicated  Markov-chain  coupling  argument  to  derive 
the  approximation 

a*'1 — — 

PV-ii*'.—*'. 


E 


i=0 


which  approximates  P{J  —  j}  much  better  than  does  (3.10)  above. 
The  resulting  approximation 


EX 


m 

Ei 

*=1 


m—1 

£ 

r=l 


y  -i -Qr  y  — i 

*  m— r  *  m— j 


>+l 


m—1 

£  -W 

'  m— \  » 
»=0 


IE' 


i=0 


is  shown  in  Sellke  ( 1992)  to  have  an  approximation  error  which  converges  to  0  exponentially 
fast  (in  m). 

4.  The  variance  and  standard  deviation  of  X 

As  was  the  case  with  EX  in  Section  3,  we  will  approximate  var  (X)  by  exploiting  the 
relationship  with  the  (Z,  W)  process.  The  trick  here  will  be  to  define  a  new  and  different 
sampling  scheme  for  which  samples  end  either  when  a  counted  flip  produces  a  heads  or 
when  a  “virgin”  ball  is  drawn.  (Again,  a  “virgin”  ball  is  one  which  has  never  been  drawn 
before.)  With  this  new  sampling  scheme,  the  number  X  of  samples  needed  to  see  all  the 
balls  is  a  sum  of  independent  geometric  random  variables,  so  the  variance  of  X  is  the  sum 
of  the  variances  of  the  summand  geometric  random  variables.  It  then  remains  to  estimate 
var(X’)—  var  (X). 

Let  {( Zn ,  Wn)}£Lj  be  exactly  the  same  as  the  {(Z„,Wn)}l^L1  process,  except  that 
Wn  is  set  equal  to  1  each  time  the  corresponding  Zn  is  different  from  all  previous  Z/s, 
i  <  n.  (Otherwise,  Wn  =  W„.)  In  terms  of  ball  draws  and  coin  flips,  the  story  is  just  as 
before,  except  that  we  don’t  see  the  results  of  coin  flips  following  draws  of  virgin  balls.  We 
pretend  that  the  unseen  flips  sure  all  heads.  Now  define  a  “repeated  sampling  process”  in 
terms  of  the  (Z,  W)  process  according  to  the  same  rules  applied  in  Section  2  to  the  (Z,  W) 
process.  Again  draws  and  flips  are  “counted”  only  when  the  ball  drawn  does  not  repeat 
a  previous  ball  of  the  current  sample,  and  W„  =  1  for  a  “counted”  flip  signals  the  end  of 
the  current  sample.  When  there  is  danger  of  ambiquity,  these  samples  will  be  referred  as 
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abbreviated  samples,  since  they  are  abbreviated  by  the  draws  of  virgin  balls.  Let  X  be  the 
number  of  abbreviated  samples  needed  to  see  all  the  balls. 

For  k  =  0, 1, . . . ,  m  —  1,  let  nk  be  the  probability  that  the  next  abbreviated  sample 
contains  a  virgin  ball  when  exactly  k  balls  have  been  seen  already,  (so  that  there  are  m  —  k 
virgin  balls  left.) 

Lemma  4.1  For  k  =  0, 1, . . . ,  m  —  1, 


7T*  = 


E 


m  —  k 
m-T*’ 


where  Tk  is  a  binomial  (k,q)  random  variable.  (As  before,  q  =  1  —  p.) 

Proof 

Think  of  the  next  abbreviated  sample  as  being  obtained  as  follows.  For  each  of  the 
k  nonvirgin  balls,  we  flip  the  p-coin  once,  before  drawing  any  more  balls.  Each  nonvirgin 
ball  becomes  a  “heads”  ball  or  a  “tails”  ball,  depending  on  the  result  of  the  coin  flip 
corresponding  to  that  ball.  Let  Tk  be  the  number  of  tails  among  the  k  flips,  so  that  Tk  ~ 
binomial  (k,  q).  Then  we  draw  balls  one  at  a  time,  without  replacement,  until  we  get  either 
a  virgin  ball  or  a  nonvirgin  “heads”  ball,  either  of  which  ends  the  current  sample.  This 
sampling  protocol  produces  proper  abbreviated  samples,  since  it  doesn’t  matter  whether 
the  coin  flipping  is  done  while  drawing  the  balls  or  before  drawing  the  balls. 

Since  these  are  m  —  k  virgin  balls  among  the  m  —  Tk  balls  which  will  terminate  the 
sample,  the  conditional  probability,  given  Tk,  that  a  virgin  ball  terminates  the  sample  is 
(m  —  fc)/(m  —  Tk).  Taking  the  expectation  gives  the  unconditional  probability  that  the 
sample  contains  a  virgin  ball.  □ 

Lemma  4.2  For  k  =  0, 1, . . . ,  m  —  1  and  any  n  =  2, 3, . . . 


**  = 


m  —  k  „,Tk  —  kq.2  —.Tfc  —  kq.n. 

•{1  +  E(— — r^)2  + . . .  +  E{ -=■ — 


m  —  k0 


m  —  kq 


m  —  kq 


m  —  kq  m  —  lk 


Proof 

Note  that  by  algebra 


m  —  k 
m  —  Tk 


m  —  k 
m  —  kq 


1  - 


m  —  kq 


Applying 


to  the  last  factor  leads  to 


1 

1  —  x 


xn+l 

=  l  +  x  +  ...  +  xn  +  - - 

1  —  X 


m-k  m  —  k.  Tk  -  kq  ,Tk  -  kq.n. 

- = - — {l  +  -S— 1  +  . . .  +  j/)w} 

m  —  Tk  m  —  kq  m  —  kq  m  —  kq 
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□ 


i  Jk  -fcgvn+i  "i-k 
m  —  kq  m  —  Tic 

Now  take  expectations,  noting  that  E(Tk  —  kq)  =  0. 

Lemma  4.3  The  relative  error  committed  in  approximating  tt*  and  also  1  —  tt*  using 

~  m-~  b  f1  ,  kM  ,  kpqjq  -  p)  + 

*k  ~  m  -  k,  (m  -  k,)2  (m  -  k,)3  (m  —  kq)4 

is  0(m-3),  uniformly  in  0  <  k  <  m  and  in  p  bounded  away  from  0. 

Proof 

Take  n  =  7  in  Lemma  4.2,  and  use  the  formulas  for  central  moments  of  binomials 
given  in  Kendall  and  Stuart  (1969),  pages  121-3.  □ 

Recall  that  r  is  the  number  of  single-ball  draws  (with  replacement)  needed  to  see  all 
the  balls.  Note  that  X  —  1  is  the  number  of  “counted”  heads  flips  (based  on  the  (Z,  W) 
process)  among  the  first  r  —  1  flips.  Likewise,  X— 1  is  the  number  of  “counted”  heads  flips 
(based  on  the  (Z,W)  process)  among  the  first  r  —  1  flips.  Let  Ai  equal  the  number  of 
Wn  s,  n  <  r,  which  equal  1  when  Wn  is  0,  so  that 


r— 1 

Aj  =:  Y,  ~ 

n=  1 


Let  Q  be  the  a—  field  generated  by  the  entire  (Z,W)  process.  Obviously,  X  is  Q  measur¬ 
able,  while  the  distribution  of  Aj,  given  Q,  is  binomial  (m  —  l,g).  Thus,  X  and  Ai  are 
independent,  and 


(4.4) 


var(X  —  Ai)  =  uar(Ai)  +  uar(X) 


m— 1 

=  (m-l)pq+J2 

k= 0 


1  ~  *k 


Now  define 

A2  =:  (X-Ai)-X, 

so  that 

(4.5)  X  =  X  -  Ai  -  A2. 

When  the  Wn’s  are  replaced  by  Wn’s,  Ai  tails  flips  among  the  first  r— 1  flips  are  replaced  by 
heads  flips.  Since  the  corresponding  ball  draws  all  produced  virgin  balls,  these  flips  were 
all  “counted”  flips  in  either  scheme.  All  “counted”  heads  in  the  original  scheme  (using 
the  ( Z ,  W)  process)  stay  “counted”  heads  in  the  new  scheme  (with  the  (Z,W)  process.) 
However,  there  may  be  a  few  (=  Aj)  “uncounted”  heads  in  the  original  (Z,  W)  scheme 
among  the  first  r  — 1  flips  which  become  “counted”  heads  in  the  new  (Z,W)  scheme.  (Recall 
that  a  coin  flip  is  “uncounted”  when  the  corresponding  ball  drawn  was  drawn  earlier  in 
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the  current  sample.)  When  a  “virgin  tails”  draw  in  the  ( Z ,  W)  process  becomes  a  “heads” 
draw  in  the  (Z, W)  process,  subsequent  draws  in  the  current,  unabbreviated  sample  which 
repeated  earlier  balls  of  that  sample  can  become  non-repeat  (and  therefore  “counted”) 
draws  in  the  (Z,W)  process. 

Now  it  remains  to  get  a  handle  on  the  effect  of  A2  on  the  variance  of  X  =X  — Ai  —  A2. 

Heuristic  reSsoning  suggests  that  the  variance  of  A2  should  be  bounded,  uniformly 
in  m  for  p  bounded  away  from  0.  Indeed,  the  probability  that  a  repeated  draw  occurs 
in  the  course  of  generating  a  particular  unabbreviated  sample  should  be  0(m-1),  with 
multiple  repeats  having  probability  0(m-2).  Since  there  are  m  or  fewer  (unabbreviated) 
samples  which  contain  virgin  balls,  the  toted  number  of  repeated  (=  uncounted)  draws  for 
these  samples  should  be  0P(1).  Note  that  A2  is  <  the  number  of  repeated  draws  in  the 
samples  containing  virgin  balls.  Finally,  it  seems  plausible  that  A2  should  be  essentially 
uncorrelated  with  X  —  A\.  Thus,  one  would  guess  that 

varX  =  var(X  -  Ai)  +  var( A2)  +  o(l) 

(4.6)  m_1  1  —  7T 

=  (m-  1  )pq  +  ™~2  *  +  var(A2)  4-  o(l), 

i= 0  k 


so  that 

m— 1 

(4.7)  (m  -  l)M  +  £  pp 

k=0  k 

should  be  an  underestimate  of  var  X  with  uniformly  bounded  error  for  all  m  and  for  p 
bounded  away  from  0.  We  will  show  that  var  (A2)  is  bounded.  We  will  also  give  an 
argument  showing  that  the  correlation  between  A2  and  X  —  Ai  goes  to  0  as  m  — ►  00,  but 
the  rate  will  not  be  enough  to  prove  (4.6).  However,  it  will  be  good  enough  to  prove  the 
following  result  for  the  standard  deviation  ax  of  X. 

Proposition  4.8  Asm-*  00, 


m—  1  . 

ax  =  {(m-  l)p$+  ~ +o(1) 

TTl 

fc=0  * 

uniformly  for  p  bounded  away  from  0. 

Remark  The  error  in  this  approximation  for  ax  remains  o(l)  when  the  expression  in  Lemma 
4.3  is  used  to  approximate  7r*.  A  more  explicit  formula  for  ax  resulting  from  application 
of  Lemma  4.3  is  given  in  Proposition  4.18. 

Lemma  (4.9)  For  each  e  >  0,  there  exists  a  constant  Ce  so  that 

var (A2)  <  Ct 


for  all  m  whenever  p>  e. 
Proof 
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Let  R  be  the  number  of  repeat  (uncounted)  draws  in  all  (unabbreviated)  samples 
containing  virgin  balls.  Note  again  that  A2  <  R.  For  1  <  i  <  m,  let  Rt  be  the  number  of 
repeat  draws  in  the  ith  sample  containing  a  virgin  ball,  if  there  is  an  ith  such  sample.  Set 
Ri  =  0  if  there  are  no  such  sample.  Thus,  R  =  R\  +  . . .  +  Rm.  Let  be  the  size  of  the 
ith  sample  containing  a  virgin  ball,  with  K(i )  =  0  if  there  is  no  such  sample. 

Let  Qi  be  the  cr-field  generated  by  everything  that  happens  up  to  and  including  the 
completion  of  the  sample  containing  the  ith  virgin  ball.  Then,  given  is  stochas¬ 

tically  less  than  or  equal  to  K*,  where 

(  (1  —  qm)~lkp2qk~l  if  1  <  <  m  —  1 

P{K*  =  &}=<(!—  qm)~lmpqm~1  if  k  =  m 
'  0  else. 

Indeed,  the  K*  distribution  is  the  exact  distribution  of  (given  Qi-\)  when  exactly  one 
virgin  ball  remains  to  be  drawn.  When  more  than  one  virgin  ball  is  left,  the  distribution 
of  K(i)  is  easily  shown  to  be  stochastically  smaller. 

The  conditional  distribution  of  R(,),  given  Qt-\  and  K(i)  =  k,  is  that  of  a  sum  of 
independent  geometric  (2~)  “number  of  failures”  random  variables  with  with  0  <  £  <  k. 
In  symbols, 

fc_1  _  p 

=  k)  =  ^GeomyC^— ). 

1=0 

This  is  because  the  number  of  repeat  draws  made  between  the  £th  and  (£  -f  1)';  distinct 
balls  of  the  sample  is  Geomf(2^),  independent  of  what  came  before.  Thus, 

£(S,olSi-.)  <  (1  £>‘-'  E 

k=  1  l=o 

But  there  exist  q,q  <  q  <  1,  and  B  (depending  on  £  >  0)  so  that  k‘qk~l  <  Bqk  for  all 
k  >  0  and  all  q  <  1  —  e.  Thus,  we  easily  get  the  bound  (writing  p  =  1—  q) 


(4.10) 


m  k— 1  , 

£  ^  E  «*  E 

m  9 

ti  m 
<  (Bp'2)-, 

771 


where  the  second  inequality  follows  from  (m  —  £)k  >  (m  —  k  +  l)Jfc  >  m/2 
k  <  m. 

From 


var{GeomF{ - )}  = 


m 


Im 


for  0  <  £  < 
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we  get 


/=0  V  '  1=0 


SO 


£(*ye(0)  <  a  -  smr' £>‘-' 


k=l 


frw  f  V'  ^  1 2 

|f-  (m-^)2  +{Z^^7} 

L«=o  7  t=o 


(4.11) 


rt-i 


t-i 


-Bp  ,5,it£<m-<>2ii+{S<m-*>*2}2 

<  Bp-2(-  +  ^j)  <  (8Bp-2)-. 


m  mx 


m 


Bounds  (4.10)  and  (4.11)  imply 


£(J22)  =  £{(£,+...  4- iZm)2} 

m  m  — 1  m 

=  ££(fl2)  +  2£  ^ 

1=1  1  =  1  j  =  t+l 

<  8 Bp~2  +  4 B2p~4, 


which  proves  Lemma  4.9.  □ 

Lemma  4.12 

covariance  ( X  —  Ai,  A2)  =  o(m),  uniformly  in  p  for  p>e,e>  0. 

Proof  of  Lemma  4.12 

First  note  from  (4.4)  and  Lemma  4.3  (or  see  (4.18)  below)  that 

m  1 

(4.13)  var(X  —  Ai)  =  m2p2  ^  —  +  0 (mlogm) 

;=i  3 


uniformly  in  p. 

The  idea  here  is  that  most  of  the  variability  in  A2  comes  from  what  happens  early  in 
the  sampling  process,  while  most  of  the  variability  in  X  —  Ai  comes  from  what  happens 
later  in  the  sampling  process.  For  each  m  >  3,  let  »*  =  t*(m)  be  the  greatest  integer  less 
than  m{l  —  (log  m)-1}.  Again  let  Qi  be  the  cr-field  generated  by  the  (Z,  W)  process  up 
through  the  completion  of  the  sample  containing  the  ith  distinct  (=virgin)  ball.  Then  it 
is  easy  to  show  that 

E{var(X  -  Attfi.)}  ^ 
var(X  —  Ai) 

as  m  — +  00.  This  in  turn  implies 

(4.14)  var{E(X  —  Ai|(?j»}  =  o{var(X  —  Ai)} 
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since 


var(X  —  Ai )  =  var{E(X  —  Ai  | </,• )}  +  E{var(X  —  Aj  | Gi- )}. 

On  the  other  hand, 

(4.15)  __  f;{var(A2|^,.)}  =  o{t»ar(A2)}, 

since  given  Gim,  there  are  only  about  m/log  m  remaining  samples  which  will  contain  virgin 
balls. 

Now  write 

(4.16)  (X-^-EiX-A,)  =  {£(X-Aiia,.)-^-Ai)}+{(X-A1)-^(X-A1|^)} 
and 

(4.17)  A2  -  E(A2)  =  {E(A2\Gi-)  -  £(A2)}  4-  {A2  -  £(A2|£,.)}. 

By  (4.13),  (4.14),  (4.15),  Lemma  4.9,  and  the  Cauchy-Schwarz  inequality,  the  expectation 
of  the  product  of  (4.16)  and  (4.17)  is  o(m),  uniformly  in  p  >  e  >  0,  which  proves  the 
lemma.  □ 

Proof  of  Proposition  4.8 

By  Lemma  4.9  and  Lemma  4.12, 

varX  =  var(X  —  Aj  —  A2) 

=  var(X  —  Ai)  +  o(m). 


By  (4.13), 


°x  =  <r*_Al  +  o(l), 


which  together  with  (4.4)  proves  Proposition  4.8. 
Proposition  4.18 


m  _  hi  . 

varX  =  {(m2p2  ~  2 mpq)  ^  —  +  mp{q  -  p)  -r) »  +  o{  1) 

>=i  3  3 


as  m  -*  oo,  uniformly  for  p  >  e  >  0. 
Proof 

Replacing  ir*  in  (4.4)  by 


771  —  k 

m  —  kq 


{1  + 


kpq  } 

(m  —  kq)2 


and  using  Lemma  4.3  implies 


(4.19) 


□ 
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after  some  calculation.  The  proposition  now  follows  from  Lemmas  4.9  and  4.12  as  before 
in  Proposition  4.8.  □ 

Remark  It  seems  likely  that  the  covariance  of  X  —  Aj  and  A2  is  uniformly  bounded  or 
perhaps  even  o(l)  as  m  — *  00  for  p  >  e  >  0.  Thus,  from  (4.19)  and  Lemma  4.9  we 
conjecture  that 

(4.20)  var(X)  =  (m2p2  -  2 mpq)  ^  —  -f  mp(q  —  p)  ^  -  +  0 (log  m ) 

j=i  3  ]=i  3 

uniformly  for  p  >  £  >  0. 

5.  Numerical  results  for  p  =  | 

Here  is  how  the  exact  values  for  EX  (with  p  =  1)  compare  with  the  approximation 
in  Proposition  3.9  for  several  values  of  m.  (The  exact  values  here  and  in  the  next  table 
were  calculated  from  Larry  Shepp’s  recursive  formulas.) 

Table  1 


m 

E(X) 

E(X)-(3.9) 

10 

13.3812 

0.05997 

20 

34.5149 

0.00649 

50 

110.5647 

0.000524 

100 

257.2316 

0.000113 

150 

417.0115 

0.000048 

200 

585.3392 

0.000026 

250 

760.0133 

0.000017 

300 

939.7404 

0.000012 

When  p=  the  approximation  for  ax  in  Proposition  4.18  equals 

(5.1) 

7T  . 

V6 

1 

fH 

1 

£ 

4tt 

to  within  0(m  1). 

The  following  table  shows  how  the  exact  values  for  ox  (for  p  =  |)  compare  with  (5.1). 

Table2 

m 

ox 

«rx-(5-l) 

10 

5.8086 

0.2320 

20 

12.1008 

0.1115 

50 

31.2789 

0.0514 

100 

63.3212 

0.0299 

150 

95.3768 

0.0217 

200 

127.4361 

0.0173 

250 

159.4970 

0.0145 

300 

191.5588 

14 

0.0125 
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